Add hi-IN , Ko-KR and pt-BR IPA tokenizer support by quapham · Pull Request #15567 · NVIDIA-NeMo/NeMo

quapham · 2026-03-31T10:22:25Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Extends the IPAG2p tokenizer to support Hindi (hi-IN) with English code-switching , Korean (ko-KR) and Brazilian Portuguese (pt-BR) locale.

Collection: [Note which collection this PR will affect]
tts, common

Changelog

Add hi-IN, ko-KR and pt-BR to SUPPORTED_LOCALES in ipa_lexicon.py
Add INDIC_CHARS_ALL support to tokenizer_utils.py and IpaG2p (enables Indic script dict parsing)
Extend IpaG2p._parse_phoneme_dict() to accept a list of dicts enabling multi-dict code-switching (e.g. Hindi + English)
Add pronunciation dict files: hi_prondict-v0.1.dict (hi_IN) and pt_br_prondict-v1.0.dict (pt_BR), ko_prondict-v1.0.dict
Add unit tests: test_ipa_tokenizer_hi_in (hi-IN/en code-switching), test_ipa_ko_kr and test_ipa_tokenizer_pt_br

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

[x ] Make sure you read and followed Contributor guidelines
[x ] Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

blisc · 2026-04-16T19:59:57Z

Can you fix the linting and sign off issues?

Copilot

Pull request overview

Extends the NeMo TTS IPA G2P/tokenizer stack to better handle additional scripts/locales (Hindi with English code-switching, Korean, and Brazilian Portuguese) by expanding tokenization character coverage, dictionary parsing, and adding unit tests to validate expected IPA outputs.

Changes:

Added unit tests for IPA tokenization in pt-BR, hi-IN (Hindi/English code-switching), and ko-KR.
Expanded “any-locale” tokenization character coverage to include Indic and Korean Unicode ranges.
Updated IpaG2p dictionary parsing and regex handling to accept Indic/Korean words and merge multiple dictionaries.

Reviewed changes

Copilot reviewed 4 out of 7 changed files in this pull request and generated 3 comments.

File	Description
tests/collections/common/tokenizers/text_to_speech/test_tts_tokenizers.py	Adds unit tests and small in-test pronunciation dictionaries for pt-BR, hi-IN code-switching, and ko-KR.
nemo/collections/tts/g2p/models/i18n_ipa.py	Extends IpaG2p regex + dictionary parsing to support Indic/Korean and multi-dict merging for code-switching.
nemo/collections/common/tokenizers/text_to_speech/tokenizer_utils.py	Adds Indic and Korean Unicode ranges and expands any-locale word tokenization regex accordingly.
nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py	Adds `pt-BR` and `ko-KR` to supported locales and extends punctuation sets for hi-IN and ko-KR.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-20T17:43:52Z

    def __init__(
        self,
-        phoneme_dict: Union[str, pathlib.Path, Dict[str, List[List[str]]]],
+        # phoneme_dict: Union[str, pathlib.Path, Dict[str, List[List[str]]]],
+        phoneme_dict: Union[str, pathlib.Path, List[Union[str, pathlib.Path]], Dict[str, List[List[str]]]],
        locale: str = "en-US",


The phoneme_dict type annotation doesn't match the supported runtime behavior: the Hindi unit test passes a list of dicts for code-switching, but the annotation only allows List[Union[str, Path]]. This will trip static type checking and makes the API contract unclear; broaden the union to allow lists containing dicts (or use a Sequence[...]) and update the parameter docstring accordingly (also remove the stale commented-out type line).

i agree with Copilot's comment. Need to remove stale commented-out type line and fix typing. This appears in three places: __init__, _parse_phoneme_dict, and replace_dict.

The type List[Union[str, pathlib.Path]] doesn't reflect the actual runtime behavior. The Hindi test passes [self.PHONEME_DICT_HI, self.PHONEME_DICT_EN], which is a list of dicts. The recursive call in _parse_phoneme_dict handles this correctly at runtime, but the type annotation is misleading.

Suggested change

def __init__(

self,

phoneme_dict: Union[str, pathlib.Path, Dict[str, List[List[str]]]],

# phoneme_dict: Union[str, pathlib.Path, Dict[str, List[List[str]]]],

phoneme_dict: Union[str, pathlib.Path, List[Union[str, pathlib.Path]], Dict[str, List[List[str]]]],

locale: str = "en-US",

def __init__(

self,

phoneme_dict: Union[

str, pathlib.Path, List[Union[str, pathlib.Path, Dict[str, List[List[str]]]]], Dict[str, List[List[str]]]

],

Signed-off-by: quanpham <youngkwan199@gmail.com>

Signed-off-by: quapham <quapham@users.noreply.github.com>

Signed-off-by: quanpham <youngkwan199@gmail.com>

Signed-off-by: quapham <quapham@users.noreply.github.com>

Signed-off-by: quanpham <youngkwan199@gmail.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

XuesongYang · 2026-04-20T23:48:05Z

    @staticmethod
    def _parse_phoneme_dict(
-        phoneme_dict: Union[str, pathlib.Path, Dict[str, List[List[str]]]]
+        phoneme_dict: Union[str, pathlib.Path, List[Union[str, pathlib.Path]], Dict[str, List[List[str]]]]


Suggested change

phoneme_dict: Union[str, pathlib.Path, List[Union[str, pathlib.Path]], Dict[str, List[List[str]]]]

phoneme_dict: Union[

str, pathlib.Path, List[Union[str, pathlib.Path, Dict[str, List[List[str]]]]], Dict[str, List[List[str]]]

],

XuesongYang · 2026-04-20T23:51:20Z


-    def replace_dict(self, phoneme_dict: Union[str, pathlib.Path, Dict[str, List[List[str]]]]):
+    def replace_dict(
+        self, phoneme_dict: Union[str, pathlib.Path, List[Union[str, pathlib.Path]], Dict[str, List[List[str]]]]


Suggested change

self, phoneme_dict: Union[str, pathlib.Path, List[Union[str, pathlib.Path]], Dict[str, List[List[str]]]]

self,

phoneme_dict: Union[

str, pathlib.Path, List[Union[str, pathlib.Path, Dict[str, List[List[str]]]]], Dict[str, List[List[str]]]

],

XuesongYang · 2026-04-20T23:52:40Z

    ) -> Dict[str, List[List[str]]]:
        """
-        parse an input IPA dictionary and save it as a dict object.
+        parse an input IPA dictionary (or multiple) and save it as a dict object.


Suggested change

parse an input IPA dictionary (or multiple) and save it as a dict object.

Parse one or more IPA dictionaries and return a merged dict object.

XuesongYang · 2026-04-20T23:54:45Z

Suggested change

Args:

phoneme_dict: A single phoneme dictionary source or a list of sources for multi-dictionary

code-switching (e.g. Hindi + English). Each source can be:

- a file path (str or pathlib.Path) in CMUdict format,

e.g. ``scripts/tts_dataset_files/ipa_cmudict-0.7b_nv22.06.txt``

- a dict object with CMUdict-like entries,

e.g. ``{"Wire": [["ˈ", "w", "a", "ɪ", "ɚ"], ["ˈ", "w", "a", "ɪ", "ɹ"]]}``

When a list is provided, all sources are parsed and merged into a single dictionary.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

XuesongYang

@quapham I made some suggestions on the PR. pls apply if you feel my suggestions are correct. Thanks!

FYI, I directly made changes for unit tests in order to cover comprehensive cases.

…kenizers.py Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Signed-off-by: quanpham <youngkwan199@gmail.com>

Signed-off-by: quapham <quapham@users.noreply.github.com>

github-actions Bot added TTS common labels Mar 31, 2026

quapham changed the title ~~Add hi-IN and pt-BR IPA tokenizer support~~ Add hi-IN , Ko-KR and pt-BR IPA tokenizer support Apr 16, 2026

blisc added the Run CICD label Apr 16, 2026

XuesongYang requested review from XuesongYang and Copilot April 20, 2026 17:38

Copilot started reviewing on behalf of XuesongYang April 20, 2026 17:39 View session

Copilot AI reviewed Apr 20, 2026

View reviewed changes

chtruong814 added Run CICD and removed Run CICD labels Apr 20, 2026

XuesongYang added the skip-linting label Apr 20, 2026

quapham and others added 8 commits April 20, 2026 12:04

feat(tts): extend IPA tokenizer with hi-IN/en code-switching and pt-BR

37fc7d8

feat(tts): remove ar-MSA locale as out of scope for this PR

ea93a40

Signed-off-by: quanpham <youngkwan199@gmail.com>

Apply isort and black reformatting

2845815

Signed-off-by: quapham <quapham@users.noreply.github.com>

Add Korean IPA support

9bf8a0b

Signed-off-by: quanpham <youngkwan199@gmail.com>

Fix leftover merge markers in Korean IPA support

4bc23b7

Signed-off-by: quanpham <youngkwan199@gmail.com>

Apply isort and black reformatting

683d5be

Signed-off-by: quapham <quapham@users.noreply.github.com>

fix: add KOREAN_CHARS import

f0bbf22

Signed-off-by: quanpham <youngkwan199@gmail.com>

Apply suggestion from @Copilot

2151454

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

XuesongYang force-pushed the hi_pt_BR_ipa branch from 065c5cb to 2151454 Compare April 20, 2026 19:04

chtruong814 added Run CICD and removed Run CICD labels Apr 20, 2026

chtruong814 temporarily deployed to test April 20, 2026 19:05 — with GitHub Actions Inactive

XuesongYang reviewed Apr 20, 2026

View reviewed changes

Comment thread nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py Outdated

Apply suggestion from @XuesongYang

fa43171

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

XuesongYang reviewed Apr 21, 2026

View reviewed changes

Comment thread tests/collections/common/tokenizers/text_to_speech/test_tts_tokenizers.py

XuesongYang requested a review from blisc April 21, 2026 00:40

XuesongYang requested changes Apr 21, 2026

View reviewed changes

Update tests/collections/common/tokenizers/text_to_speech/test_tts_to…

7ba7767

…kenizers.py Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>